# Install required packages
install.packages(c("AmesHousing", "kernlab", "pdp"))
Data on median housing values from 506 census tracts in the suburbs of Boston from the 1970 census. This data frame is a corrected version of the original data by Harrison and Rubinfeld (1978) with additional spatial information. The data are available in the pdp package which were taken directly from mlbench after the removal of unneeded columns (i.e., name of town, census tract, and the uncorrected median home value). See ?pdp::boston for details.
# Load the Boston housing data
data(boston, package = "pdp")
# View the Ames housing data
DT::datatable(boston, extensions = "FixedColumns",
options = list(
dom = "t",
scrollX = TRUE,
fixedColumns = TRUE
))
The AmesHousing package contains a processed version of the data described by De Cock (2011) where 82 fields were recorded for 2,930 properties in Ames IA. A description of each field can be found here. This is a more contemporary alternative to the Boston housing data described above.
# Load the Ames housing data
ames <- AmesHousing::make_ames()
# View the Ames housing data
DT::datatable(ames, extensions = "FixedColumns",
options = list(
dom = "t",
scrollX = TRUE,
fixedColumns = TRUE
))
There is no simple rule for determining the edibility of a mushroom; no rules like “leaflets three, let it be”, “hairy vine, no friend of mine” and “berries white, run in fright” for poison ivy. The following data, taken from the UCI Machine Learning Reposirtory contain 21 physical characteristics on 8,124 mushrooms classified as either poisonous or edible.
# Load the mushroom data
mushroom <- read.csv("https://bgreenwell.github.io/MLDay18/data/mushroom.csv")
# View the mushroom data
DT::datatable(mushroom, extensions = "FixedColumns",
options = list(
dom = "t",
scrollX = TRUE,
fixedColumns = TRUE
))
This is a data set collected at Hewlett-Packard Labs, that classifies 4,601 e-mails as spam or non-spam (i.e., “ham”). There are 2,788 non-spam e-mails and 1.813 spam e-mails. In addition to this class label there are 57 variables indicating the frequency of certain words and characters in the e-mail. The first 48 variables contain the frequency of the variable name (e.g., business) in the e-mail. If the variable name starts with “num” (e.g., num650) then it indicates the frequency of the corresponding number (e.g., 650). The variables 49–54 indicate the frequency of the characters ‘;’, ‘(’, ‘[’, ‘!’, ‘$’, and ‘#’. The variables 55–57 contain the average, longest and total run-length of capital letters. Variable 58 indicates the type of the mail and is either “nonspam” or “spam”, i.e. unsolicited commercial e-mail. See ?kernlab::spam for details.
The “spam” concept is diverse: advertisements for products/web sites, make money fast schemes, chain letters, pornography, etc. This collection of spam e-mails came from the collectors’ postmaster and individuals who had filed spam. The collection of non-spam e-mails came from filed work and personal e-mails, and hence the word ‘george’ and the area code ‘650’ are indicators of non-spam. These are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter.
# Load the email spam data
data(spam, package = "kernlab")
# View the email spam data
DT::datatable(spam, extensions = "FixedColumns",
options = list(
dom = "t",
scrollX = TRUE,
fixedColumns = TRUE
))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## https://rstudio.github.io/DT/server.html